Utilizing regular expressions for instance-based schema matching
نویسندگان
چکیده
Statistical data consists mostly of numerical values, entries of codelists like country codes or acronyms for gender. Such values are typically described according to specific patterns. In this paper we present a novel approach for instance-based schema matching, where regular expressions are utilized for matching patterns of instance values. 1 Motivation and Background In various domains, e.g. the social sciences, the matching of statistical data is a typical task. Schema elements of statistical data, e.g. rows or columns of a spreadsheet, are named usually by simple and short labels, sometimes even with abbreviated terms. However, the structure and semantics of their instances (e.g. numerical values, entries of codelists) differ in various aspects from text-heavy data. Instances are often described by a specific syntactical pattern, e.g. dates consist of numerical values divided by periods or slashes or a three-letter code for a geographical area. For instance-based schema matching [3] states that different domains reveal new challenges like treating new types of information resources, e.g. spatial or temporal information or domain-specific constrains. According to [2] especially domain-specific values, significant occurrences and patterns of values are relevant characteristics to be considered at instance level, as well as integrity constraints for schema elements and their instance values. In [1] the matching process is enhanced by applying a constraint-based matching. Moreover, regular expressions and catchwords are considered for instance-based schema matching in [4]. We focus on statistical data, where the potential of patterns and regular expressions for schema matching can be fully exposed. 2 Schema Matching using Regular Expressions By utilizing pattern classes our approach considers two schema elements as a match, if their instances can be expressed via at least one regular expression of the same pattern class. We define multiple pattern classes, which correspond to a specific data element, e.g. dates, age groups or geographical codes, and contain various patterns for describing this data element. For a data element ”date” different patterns might be e.g. [0-9]{4}, [0-9]{2}-[0-9]{4} or [0-9]{2}.[0-9]{2}.[0-9]{4}. Each pattern is expressed as a regular expression and is assigned a weighting, which states the accuracy of the pattern to compass typical instances of the data element. Inside a pattern class the regular expressions are sorted by their weightings in descending order. We assume two datasets M and N with their schema elements SM ∈ M and SN ∈ N . The pattern classes Cx with Cx = {(regex, ω)|regex matches x, 0 < ω < 1} contain multiple regular expressions regex describing the statistical data element x of the class. They are accompanied with a weighting ω. For each pattern class Cx, we compute an average weighting for every schema element SM and SN . This average weighting indicates how often instances of the schema element can be expressed by a pattern of the class. Hereby, as soon as an instance can be expressed by a (regex, ω) ∈ Cx, the value of ω is added to the sum of all weightings, whose regular expressions previously matched another instance from this same schema element, resulting in the final ∑ 0 ω. The average is then retrieved by normalizing this sum regarding the total number of instances inside this particular schema element. For each SM , this is avg(SM ) = ∑ 0 ω |Instances in SM | . For SN the average is calculated analogously. If this average weight is not 0, the schema element is collected among its average weight in a set. We define these sets as Mx and Nx with Mx = {(SM , avg(SM )} and Nx = {(SN , avg(SN )}. The Cartesian product of Mx and Nx is computed and added to Matchesx, in which a triple (SM , SN , avg(SM ) ∗ avg(SN )) defines a match between a SM and a SN with the probability of avg(SM ) ∗ avg(SN ). Finally, the result set Matchesx contains all matches between two datasets M and N . Our approach has been implemented in Java using the JENA API. The source code and an executable jar file are available at https://github.com/mazlo/smurf. In first experiments with real-world statistical data we obtained better results for matching schema elements than other existing matching systems. A detailed evaluation with generic test datasets is currently work-in-progress. We aim to extend our approach to extract patterns from instance values and to generate weightings automatically. Feature extraction from instance values can enhance our approach in computing weightings and in assigning regular expressions to adequate pattern classes.
منابع مشابه
Instance-based ontology matching and the evaluation of matching systems
The matching of heterogeneous information sources is a crucial task in many different domains. In order to find relations between the different pieces of information, which are annotated using different structures and formats, matching systems have been developed. In the past two decades, ontologies became more and more important as a way to represent the semantics of information in a machine r...
متن کاملRegular Expressions with Numerical Occurrence Indicators - preliminary results
Regular expressions with numerical occurrence indicators (#REs) are used in established text manipulation tools like Perl and Unix egrep, and in the recent W3C XML Schema Definition Language. Numerical occurrence indicators do not increase the expressive power of regular expressions, but they do increase the succinctness of expressions by an exponential factor. Therefore methods based on straig...
متن کاملAn Improved Semantic Schema Matching Approach
Schema matching is a critical step in many applications, such as data warehouse loading, Online Analytical Process (OLAP), Data mining, semantic web [2] and schema integration. This task is defined for finding the semantic correspondences between elements of two schemas. Recently, schema matching has found considerable interest in both research and practice. In this paper, we present a new impr...
متن کاملDeterministic Regular Expressions With Back-References
Most modern libraries for regular expression matching allow back-references (i. e., repetition operators) that substantially increase expressive power, but also lead to intractability. In order to find a better balance between expressiveness and tractability, we combine these with the notion of determinism for regular expressions used in XML DTDs and XML Schema. This includes the definition of ...
متن کاملType-Based Optimization for Regular Patterns
Pattern matching mechanisms based on regular expressions feature in a number of recent languages for processing XML. The flexibility of these mechanisms demands novel approaches to the familiar problems of pattern-match compilation—how to minimize the number of tests performed during pattern matching while keeping the size of the output code small. We describe work in progress on a compilation ...
متن کامل